Applications in Plant Sciences — Latest Matching Preprints

1

AI-based prediction of herbarium sequencing success across the plant tree of life

Ranjbaran, Y.; Maurin, O.; Canadelli, E.; Morosinotto, T.; Weech, M.-H.; Kersey, P.; Antonelli, A.; Baker, W. J.; Sales, G. J.; Dal Grande, F.

2025-02-07 plant biology 10.1101/2025.02.03.636220 medRxiv

Top 0.1%

53.8%

Show abstract

DNA recovered from herbarium specimens represents a vital asset in botanical research, playing a pivotal role in unravelling the evolution, diversity, and ecological dynamics of plants. Despite its importance, challenges such as fragmented DNA and insufficient sequencing yields render molecular data retrieval a high-risk and costly endeavour involving the use of non-replaceable herbarium specimens. Here, we propose a framework based on Artificial Intelligence (AI) to forecast the success of genomic DNA extraction suitable for sequencing from herbarium samples. Our model integrates morphological characteristics and sample colour derived from scanned herbarium images, metadata including sample age and locality, and DNA quantity measurements of samples. We train a deep learning algorithm with ca. 2,000 specimens that have been digitized and sequenced in the framework of the Plant and Fungal Trees of Life (PAFTOL) Project, spanning from year 1832 to the present. As training datasets increase with ongoing digitization and genomic sequencing efforts, our AI predictive model can support researchers in selecting the herbarium samples with the highest likelihood of yielding high-quality genomic DNA from amongst a vast array of globally distributed candidate specimens. Our approach enhances the contribution of herbarium-derived DNA in large-scale studies and facilitates the utilisation of historical collections for a deeper understanding of plant evolution and ecology, with implications for conservation.

2

Generating, curating, and evaluating trnL reference sequence databases: Benchmarking OBITools3/ecoPCR, RESCRIPt, and MetaCurator

KUDDAR, O. S.; Meiklejohn, K. A.; Callahan, B. J.

2026-04-10 bioinformatics 10.64898/2026.04.07.717010 medRxiv

Top 0.1%

52.1%

Show abstract

Plant DNA metabarcoding enables the identification of plant taxa in mixed samples, with the trnL (UAA) intron and its P6 loop mini-barcode region performing as well as or better than other commonly used markers. Reliable metabarcoding requires high-quality reference databases, yet a regularly maintained trnL resource is currently lacking. Consequently, most studies use uncurated sequences downloaded directly from public repositories without essential validation. We address these gaps by providing guidance through a systematic comparison of three database curation tools - OBITools3/ecoPCR, RESCRIPt, and MetaCurator - to generate three trnL reference sequence databases and evaluate their classification performance across commonly sequenced trnL regions (CD, CH, and GH). Reference trnL sequences and taxonomy files were retrieved from public sequence repositories and curated using standardized filtering steps to reduce taxonomic errors, sequence ambiguity, and redundancy. Four simulated query datasets--two base sets and their mutated counterparts--were constructed to assess classification performance of the databases using the Naive Bayesian Classifier implemented in DADA2.- The evaluation showed that performance differed by trnL region: MetaCurator and RESCRIPt yielded higher and similar metrics for trnL CD; OBITools3/ecoPCR and RESCRIPt were comparable for trnL CH; and MetaCurator attained the highest performance for trnL GH region. All reference databases, taxonomy, and evaluation files are available at Zenodo (https://doi.org/10.5281/zenodo.17969450). The complete computational workflow and scripts are available on GitHub (https://github.com/oskuddar/trnL_DB). Although evaluation was focused on plant taxa in the United States, the resulting databases are suitable for use as global trnL reference databases.

3

SPrOUT: A computational and targeted sequencing approach for mixed plant DNA identification with Angiosperms353

Hu, N.; Bullock, M. R.; Jackson, C.; Miller, C.; Hunter, E.; Huff, C.; Chen, Y.; Handy, S.; Johnson, M.

2026-02-23 bioinformatics 10.64898/2026.02.20.707031 medRxiv

Top 0.1%

46.6%

Show abstract

PremiseThe identification of plant species from mixed samples is crucial in various fields, including ecological surveys, conservation efforts, and food and dietary supplement safety. Traditional methods face potential challenges due to the high costs of DNA sequencing, inefficiencies in computational workflows, and incomplete sequence databases. Methods and ResultsThis study introduces a novel approach using the Angiosperms353 target sequencing kit for efficient taxonomic identification of angiosperm DNA in mixed samples. Our method assembles short pair-end reads for each mixed sample. Using gene sets of Angiosperms353 from 871 species, we apply phylogenetic inference to categorize the variance in phylogenetic distance across genes to identify the presence of taxa in mixed plant samples. The pipeline reaches 98.1 to 99.6% accuracy, 92.9 to 100% precision for identifying unknown taxa in in-silico mixes, and 90.7% accuracy and 98.0% precision for mock supplement mixtures. We explored the parameter cutoffs of the pipeline to offer an empirical range for different applications. ConclusionsThe Angiosperms353 and HybPiper assembly proved effective in sorting mixed plant DNA samples. Our method offers a framework for scientific and practical applications in plant species identification in both single and mixed samples.

4

Turning a new leaf: PhenoVision provides leaf phenology data at the global scale

Grady, E. L.; Denny, E. G.; Seltzer, C. E.; Deck, J.; Li, D.; Dinnage, R.; Guralnick, R. P.

2025-09-29 ecology 10.1101/2025.09.26.678778 medRxiv

Top 0.1%

46.0%

Show abstract

Plant phenology dictates many aspects of community function and ecosystem dynamics. Yet, global phenology data are still limited, especially in areas lacking monitoring programs. Here we present a new data resource, PhenoVision-Leaf, which extends a computer-vision pipeline utilizing iNaturalist digital image vouchers to produce global-scale leaf phenophase data for deciduous, woody genera. We first discuss our implementation of a new human annotation framework for leaf phenology on iNaturalist, aligning with phenophase definitions used by the larger phenology community. We then showcase the use of 165,988 crowdsourced annotated records to train a Vision Transformer model with a two-stage regime to maximize accuracy across single- and multi-image records. This approach extends Phenovision from scoring individual images to aggregating at the iNaturalist record level, better aligning with human annotation processes. Post-hoc validation showed high performance for detecting present green and colored leaves (>98% accuracy), and reasonable accuracy for breaking leaf buds (>87% accuracy). Applying PhenoVision-Leaf to over 26 million iNaturalist records yielded 5.6 million record-level phenology observations across 6,500 species and 57 families, filling geographic and taxonomic gaps. These data, now accessible through the Phenobase portal, establish a foundation for near real-time monitoring of leaf phenology, supporting global-scale synthesis analyses.

5

High-throughput iNaturalist image analysis reveals flower color divergence in Monarda fistulosa

McKenzie, P. F.; Church, S. H.; Hopkins, R.

2025-05-26 evolutionary biology 10.1101/2025.05.21.655392 medRxiv

Top 0.1%

45.5%

Show abstract

Characterizing patterns of trait variation across widespread species is a fundamental goal of natural history. Here we create a pipeline to analyze a large community science dataset and test hypothesized flower color divergence across the range of a widespread wildflower. Monarda fistulosa is a North American perennial that produces showy lavender inflorescences. Although previous literature suggests that the flowers of western M. fistulosa might display a deeper purple color than the eastern varieties, this divergence has not been assessed at scale. We process over 40,000 community science photographs of M. fistulosa to identify flowers and extract color. We demonstrate that the flowers of the montane western variety have lower lightness and higher chroma, corresponding to a deeper violet color, than those of eastern M. fistulosa. Our approach and validation provides a scalable framework for phenotyping community science images and enables analysis of geographic color variation in other widespread species.

6

A massive community-science dataset reveals convergent evolution of delayed flowering phenology in North American red-flowering plants

McKenzie, P. F.; Berardi, A. E.; Hopkins, R.

2024-10-22 evolutionary biology 10.1101/2024.09.25.614826 medRxiv

Top 0.1%

44.9%

Show abstract

The radiation of angiosperms is marked by a phenomenal diversity of floral size, shape, color, scent, and reward. Through hundreds of years of documentation and quantification, scientists have sought to make sense of this variation by defining pollination syndromes. These syndromes are the convergent evolution of common suits of floral traits across distantly related species that have evolved by selection to optimize pollination strategies. The availability of community-science datasets provides an opportunity to develop new tools and to examine new traits that may help further characterize broad patterns of flowering plant diversity. Here we test the hypothesis that flowering phenology can also be a pollination syndrome trait. We generate a novel flower color dataset by using GPT-4 with Vision (GPT-4V) to assign flower color to 11,729 North American species. We map these colors to 1,674,908 community-scientist observations of flowering plants to investigate patterns of phenology. We demonstrate constrained flowering time in the eastern United States for plants with red or orange flowers relative to plants with flowers of other colors. Red-and orange-colored flowers are often characteristic of the "hummingbird" pollination syndrome; importantly, the onset of red and orange flowers corresponds to the arrival of migratory hummingbirds. Our results suggest that the hummingbird pollination syndrome can include flowering phenology and reveal an opportunity to expand the suite of traits included in pollination syndromes. Our methods demonstrate an effective pipeline for leveraging enormous amounts of community science data by using artificial intelligence to extract information about patterns of trait variation.

7

Procrustean pseudo-landmark methods in Python to measure massive quantities of leaf shape data

Hightower, A. T.; Hall, S.; Camacho, R. U.; Papamichail, A.; Adamski, E.; Colligan, C.; Deneen, A.; Dunn, G.; Haziza, J.; Henley, C.; Pawawongsak, A.; Simms, L.; Ward, S.; Balant, M.; Blackwood, C.; Cannon, C.; Case, A.; Husbands, A.; Josephs, E. H.; Migicovsky, Z.; Naegele, R.; Patterson, E.; Saavedra-Rojas, Y.-A.; Chitwood, D. H.

2025-08-09 plant biology 10.1101/2025.08.08.669192 medRxiv

Top 0.1%

44.6%

Show abstract

PremiseWhen examining leaf shapes that are different from one another, it can be difficult to compare both the overall leaf shape and points along the leaf margin in biologically and statistically meaningful ways. MethodTo address this problem, we present a simple and user-friendly leaf shape analysis in Jupyter Notebook and Python that uses pseudo-landmarks and Generalized Procrustes Analysis to measure and compare the shape of any leaf. To demonstrate our analysis, we created a repository of real leaves gathered from eight experimental datasets. ResultsUsing our leaf repository, we explain how we can use pseudo-landmarks to compare all leaf shapes both within and between species using dimension reduction techniques like Principal Component Analysis and can predict leaf shapes using pseudo-landmarks through Linear Discriminant Analysis. Our leaf shape analysis also maps differences in shape as leaves grew around a rosette, showing the transition of shape across development (phyllotaxy). Finally, we showed how we can investigate the relationship between leaf shape variation and genetic diversity by combining shape with genetic data. DiscussionThrough the use of Generalized Procrustes Analysis and pseudo-landmarks, our leaf shape analysis presents a powerful tool for examining the shape of any leaf across multiple biological, ecological, evolutionary, and developmental scales.

8

Compositae-ParaLoss-1272: Complementary sunflower specific probe-set reduces issues with paralogs in complex systems

Moore-Pollard, E. R.; Jones, D. S.; Mandel, J. R.

2023-07-21 plant biology 10.1101/2023.07.19.549085 medRxiv

Top 0.1%

42.1%

Show abstract

PremiseThe sunflower family specific probe set, Compositae-1061, has enabled family-wide phylogenomic studies and investigations at lower-taxonomic levels by targeting 1,000+ genes. However, it generally lacks resolution at the genus to species level, especially in groups with complex evolutionary histories including polyploidy and hybridization. MethodsIn this study, we developed a new Hyb-Seq probe set, Compositae-ParaLoss-1272, designed to target orthologous loci in Asteraceae family members. We tested its efficiency across the family by simulating target-enrichment sequencing in silico. Additionally, we tested its effectiveness at lower taxonomic levels in genus Packera which has a complex evolutionary and taxonomic history. We performed Hyb-Seq with Compositae-ParaLoss-1272 for 19 taxa which were previously studied using the Compositae-1061 probe set. Sequences from both probe sets were used to generate phylogenies, compare topologies, and assess node support. ResultsWe report that Compositae-ParaLoss-1272 captured loci across all tested Asteraceae members. Additionally, Compositae-ParaLoss-1272 had less gene tree discordance, recovered considerably fewer paralogous sequences, and retained longer loci than Compositae-1061. DiscussionGiven the complexity of plant evolutionary histories, assigning orthology for phylogenomic analyses will continue to be challenging. However, we anticipate this new probe set will provide improved resolution and utility for studies at lower-taxonomic levels and complex groups in the sunflower family.

9

Making the most out of it: shallow genome-skimming possibilities for the systematics of prickly lineages of Solanum (Solanaceae)

Alves, R. T. d. L.; Gouvea, Y. F.; Dalapicolla, J.; Poczai, P.; Giacomin, L. L.

2026-07-09 plant biology 10.64898/2026.07.08.737304 medRxiv

Top 0.1%

37.9%

Show abstract

Premise: Genome skimming (GS) is a cost-effective approach for plant phylogenomics, but its ability to recover informative datasets from different genomic compartments, particularly genome-wide SNPs, remains poorly explored in Solanum. Methods: We evaluated shallow GS for phylogenetic inference in South American prickly Solanum lineages by recovering plastid, mitochondrial, and nuclear datasets, including coding regions and genome-wide SNPs. Phylogenies were inferred using maximum-likelihood and coalescent approaches under different SNP filtering strategies. Results: GS successfully recovered complete plastomes, organellar coding regions, and large SNP datasets, but failed to consistently assemble mitochondrial genomes or recover low-copy nuclear genes. SNP-based analyses, especially from the nuclear genome, produced stable, well-supported phylogenies that were largely congruent across inference methods. In contrast, coding-region datasets, particularly from the mitochondrial genome, showed greater topological discordance, revealing cytonuclear conflict. Discussion: Our results demonstrate that shallow GS is an effective strategy for generating informative SNP datasets for phylogenetic inference in Solanum, despite limitations in recovering complete mitochondrial genomes and low-copy nuclear loci. SNP-based analyses substantially expand the phylogenetic potential of GS, providing a practical and cost-effective alternative for systematic studies.

10

hybpiper-rbgv and yang-and-smith-rbgv: Containerization and additional options for assembly and paralog detection in target enrichment data

Jackson, C.; McLay, T.; Schmidt-Lebuhn, A. N.

2021-11-10 bioinformatics 10.1101/2021.11.08.467817 medRxiv

Top 0.1%

37.9%

Show abstract

PREMISEThe HybPiper pipeline has become one of the most widely used tools for the assembly of target enrichment (sequence capture) data for phylogenomic analysis. Between the production of locus sequences and phylogenetic analysis, the identification of paralogs is a critical step ensuring accurate inference of evolutionary relationships. Algorithmic approaches using gene tree topologies for the inference of ortholog groups are computationally efficient and broadly applicable to non-model organisms, especially in the absence of a known species tree. Unfortunately, software compatibility issues, unfamiliarity with relevant programming languages, and the complexity involved in running numerous subsequent analysis steps continue to limit the broad uptake of these approaches and constrain their application in practice. METHODS AND RESULTSWe updated the scripts constituting HybPiper and a pipeline for the inference of ortholog groups ("Yang and Smith") to provide novel options for the treatment of supercontigs, remove bugs, and seamlessly use the outputs of the former as inputs for the latter. The pipelines were containerised using Singularity and implemented via two Nextflow pipelines for easier deployment and to vastly reduce the number of commands required for their use. We tested the pipelines with several datasets, one of which is presented for demonstration. CONCLUSIONShybpiper-rbgv and yang-and-smith-rbgv provide easy installation, user-friendly experience, and robust results to the phylogenetic community. They are presently used as the analysis pipeline of the Australian Angiosperm Tree of Life project. The pipelines are available at https://github.com/chrisjackson-pellicle.

11

A workflow for practical training in ecological genomics using Oxford Nanopore long-read sequencing

Foster, R.; De Weerd, H.; Medd, N. C.; Booth, T.; Newman, C.; Ritch, H.; Santoyo-Lopez, J.; Trivedi, U.; Twyford, A. D.

2024-09-04 genomics 10.1101/2024.09.03.610948 medRxiv

Top 0.1%

36.4%

Show abstract

Long-read single molecule sequencing technologies continue to grow in popularity for genome assembly and provide an effective way to resolve large and complex genomic variants. However, uptake of these technologies for teaching and training is hampered by the complexity of high molecular weight DNA extraction protocols, the time required for library preparation and the costs for sequencing, as well as challenges with downstream data analyses. Here, we present a full long-read workflow optimised for teaching, that covers each stage from DNA extraction, to library preparation and sequencing, to data QC and genome assembly and characterisation, that can be completed in under two weeks. We use a specific case study of plant identification, where students identify an anonymous plant sample by sequencing and assembling the genome and comparing it to other samples and to reference databases. In testing, long-read genome skimming of nine wild-collected plant species extracted with a modified kit-based approach produced an average of 8Gb of Oxford Nanopore data, enabling the complete assembly of plastid genomes, and partial assembly of nuclear genomes. In the classroom, all students were able to complete the protocols, and to correctly identify their plant samples based on BOLD searches of barcoding loci extracted from the plastid genome, coupled with phylogenetic analyses of whole plastid genomes. We supply all the learning material and raw data allowing this to be adapted to a range of teaching settings.

12

Progress Towards Plant Community Transcriptomics: Pilot RNA-Seq Data from 24 Species of Vascular Plants at Harvard Forest

Marx, H. E.; Jorgensen, S. A.; Wisely, E.; Li, Z.; Dlugosch, K. M.; Barker, M. S.

2020-04-01 genomics 10.1101/2020.03.31.018945 medRxiv

Top 0.1%

33.5%

Show abstract

O_LIPremise of the study: Large scale projects such as NEON are collecting ecological data on entire biomes to track and understand plant responses to climate change. NEON provides an opportunity for researchers to launch community transcriptomic projects that ask integrative questions in ecology and evolution. We conducted a pilot study to investigate the challenges of collecting RNA-seq data from phylogenetically diverse NEON plant communities, including species with diploid and polyploid genomes. C_LIO_LIMethods: We used Illumina NextSeq to generate >20 Gb of RNA-seq for each of 24 vascular plant species representing 12 genera and 9 families at the Harvard Forest NEON site. Each species was sampled twice, in July and August 2016. We used Transrate, BUSCO, and GO analyses to assess transcriptome quality and content. C_LIO_LIResults: We obtained nearly 650 Gb of RNA-seq data that assembled into more than 755,000 translated protein sequences across the 24 species. We observed only modest differences in assembly quality scores across a range of k-mer values. On average, transcriptomes contained hits to >70% of loci in the BUSCO database. We found no significant difference in the number of assembled and annotated genes between diploid and polyploid transcriptomes. C_LIO_LIDiscussion: Our resource provides new RNA-seq datasets for 24 species of vascular plants in Harvard Forest. Challenges associated with this type of study included recovery of high quality RNA from diverse species and access to NEON sites for genomic sampling. Overcoming these challenges offers clear opportunities for large scale studies at the intersection of ecology and genomics. C_LI

13

Discrimination of Annonaceae using herbarium leaf reflectance spectra under limited sample size conditions

Boughalmi, K.; Santacruz Endara, P. G.; Bennett, L. A.; Ecarnot, M.; Bazan, S.; Bastianelli, D.; Bonnal, L.; Couvreur, T. L. P.

2025-09-05 plant biology 10.1101/2025.09.02.673631 medRxiv

Top 0.1%

26.2%

Show abstract

PremiseHerbarium collections offer an unparalleled archive of plant biodiversity, but their use for species identification through spectral data remains constrained by uncertain effects of preservation histories. This study assesses whether barium specimens can reliably predict species based on its leaf reflectance spectrum, despite variations in age, geographic origin, or conservation method under limited sample size conditions. MethodsWe scanned herbarium specimens of different ages and geographic distribution of 14 species of the pantropical Annonaceae. In addition, we used a second dataset of 9 species where some specimens were conserved in alcohol prior to drying and some not. We used five supervised classification models frequently used for high-dimensional data such as spectroscopy. ResultsAll models achieved high accuracy (>80%) when trained on multiple specimens per species. However, when using only one specimen per species, accuracy varied substantially depending on the taxon. DiscussionOur findings demonstrate that herbarium specimens often retain a strong taxonomic signal in their spectra, however, inter-individual variability affects accuracy in some taxa. These findings confirm the usefulness of herbarium spectroscopy as a non-destructive tool for species identification and offer a promising avenue for digitizing historical biodiversity data into high-dimensional trait space.

14

De novo homology assessment from landmark data: A workflow to identify and track segmented structures in plant time series images

Hodge, J. G.; Li, Q.; Doust, A.

2021-02-21 plant biology 10.1101/2021.02.21.432162 medRxiv

Top 0.1%

22.5%

Show abstract

Assessing the phenotypes underlying plant growth and development is integral to exploring the development, genetics, and evolution of morphology and plays an essential role in agronomic and basic research studies. Although various automated or semi-automated phenomic approaches have recently been developed, tools assessing differential growth of plant organs remains a key topic of interest, but one which is often difficult to analyze due to the requirements of segmenting and annotating specific structures or positions in the plant body in time-series data. To address this gap, we have developed a generalized workflow linking our previously published function, acute, with a companion function, homology, in the PlantCV environment. The homology function uses a generalized strategy of dimensionality reduction via starscape followed by hierarchical clustering through constella to identify constellations of segments in eigenspace that represent the same landmark in consecutive images of a time-series. We devised a quality control function, constellaQC, that can test the accuracy of the clustering approach, and we use it to show that the approach accurately clustered the pseudo-landmarks derived from acute, although with several sources of error. We discuss the reasons for and consequences of these errors in automated workflows, and suggest how to develop these functions so that they can easily be repurposed for other phenomics datasets that may vary in dimensional complexity.

15

The Case for Retaining Natural Language Descriptions of Phenotypes in Plant Databases and a Web Application as Proof of Concept

Braun, I. R.; Bassham, D. C.; Lawrence-Dill, C. J.

2021-02-06 bioinformatics 10.1101/2021.02.04.429796 medRxiv

Top 0.1%

21.8%

Show abstract

Similarities in phenotypic descriptions can be indicative of shared genetics, metabolism, and stress responses, to name a few. Finding and measuring similarity across descriptions of phenotype is not straightforward, with previous successes in computation requiring a great deal of expert data curation. Natural language processing of free text descriptions of phenotype is often less resource intensive than applying expert curation. It is therefore critical to understand the performance of natural language processing techniques for organizing and analyzing biological datasets and for enabling biological discovery. For predicting similar phenotypes, a wide variety of approaches from the natural language processing domain perform as well as curation-based methods. These computational approaches also show promise both for helping curators organize and work with large datasets and for enabling researchers to explore relationships among available phenotype descriptions. Here we generate networks of phenotype similarity and share a web application for querying a dataset of associated plant genes using these text mining approaches. Example situations and species for which application of these techniques is most useful are discussed. Database URLsThe database and analytical tool called QuOATS are available at https://quoats.dill-picl.org/. Code for the web application is available at https://git.io/Jtv9J. Datasets are available for direct access via https://zenodo.org/record/7947342#.ZGwAKOzMK3I. The code for the analyses performed for the publication is available at https://github.com/Dill-PICL/Plant-data and https://github.com/Dill-PICL/NLP-Plant-Phenotypes.

16

Harmonising digitised herbarium data to enhance biodiversity knowledge: creating an updated checklist for the flora of Greenland.

Whitley, B. S.; Abermann, J.; Alsos, I. G.; Biersma, E. M.; Gardman, V.; Hoye, T. T.; Jones, L.; Khelidj, N. M.; Li, Z.; Losapio, G.; Pape, T.; Raundrup, K.; Schmitz, P.; Silva, T.; Wirta, H.; Roslin, T.; Ahlstrand, N. I.; de Vere, N.

2024-12-05 ecology 10.1101/2024.12.01.626242 medRxiv

Top 0.1%

21.6%

Show abstract

International efforts to digitise herbarium specimens provide the building blocks for a global digital herbarium. However, taxonomic changes and errors can result in inconsistencies when amalgamating specimen metadata, that compromise the assignment of occurrence records to correct taxa, and the subsequent interpretation of patterns in biodiversity. We present a novel workflow to mass-curate digital specimens. By employing existing digital taxonomic backbones, we aggregate specimen names by their accepted name and flag remaining cases for manual review. We then validate names using site-specific floras, balancing automation with taxonomic expert-based curation. Applying our workflow to the vascular plants of Greenland, we harmonised 175,266 digitised herbarium specimens and observations from 92 data providers from the Global Biodiversity Information Facility (GBIF). The harmonised metacollection for the Greenland flora contains 780 plant species. Our workflow increases the number of species known from Greenland compared to other currently available species checklists and increases the mean number of occurrences per species by 42.6. Our workflow illustrates the integration required in order to create a global, universally accessible digital herbarium, and shows how previous obstacles to database curation can be overcome through a combination of automation and expert curation. From the specific perspective of the Greenland flora, our approach arrives at a new checklist of taxa, a new curated metacollection of occurrence data, and revised estimates of plant richness. The list of taxa and their prevalence allow a new basis for biodiversity assessment and conservation planning. Societal Impact StatementDigitising plant collections has allowed for data to be aggregated across multiple collections, forming a single harmonised resource of unprecedented scale. This resource is only accurate once the database names are assigned to one accepted name per species. We established a semi-automated workflow for processing plant name data, leveraging taxonomic backbones and employing taxonomic expertise at key stages. Applying our workflow to the flora of Greenland, we developed a curated checklist of 780 species, capturing greater species richness than previously published, while also curating 175,266 plant records. Our findings redefine our knowledge of Greenlandic plant diversity, while harmonising a vast digital collection for further research.

17

A novel phylogenomics pipeline reveals complex pattern of reticulate evolution in Cucurbitales

Ortiz, E. M.; Hoewener, A.; Shigita, G.; Raza, M.; Maurin, O.; Zuntini, A.; Forest, F.; Baker, W. J.; Schaefer, H.

2023-11-01 bioinformatics 10.1101/2023.10.27.564367 medRxiv

Top 0.1%

21.3%

Show abstract

A diverse range of high-throughput sequencing data, such as target capture, RNA-Seq, genome skimming, and high-depth whole genome sequencing, are used for phylogenomic analyses but the integration of such mixed data types into a single phylogenomic dataset requires a number of bioinformatic tools and significant computational resources. Here, we present a novel pipeline, CO_SCPLOWAPTUSC_SCPLOW, to analyze mixed data in a fast and efficient way. CO_SCPLOWAPTUSC_SCPLOW assembles these data types, allows searching of the assemblies for loci of interest, and finally produces alignments filtered for paralogs. If reference target loci are not available for the studied taxon, CO_SCPLOWAPTUSC_SCPLOW can also be used to discover new putative homologs via sequence clustering. Compared to other software, CO_SCPLOWAPTUSC_SCPLOW allows the recovery of a greater number of more complete loci across a larger number of species. We apply CO_SCPLOWAPTUSC_SCPLOW to assemble a comprehensive mixed dataset, comprising the four types of sequencing data for the angiosperm order Cucurbitales, a clade of about 3,100 species in eight mainly tropical plant families, including begonias (Begoniaceae) and gourds (Cucurbitaceae). Our phylogenomic results support the currently accepted circumscription of Cucurbitales except for the position of the holoparasitic Apodanthaceae, which group with Rafflesiaceae in Malpighiales. A subset of mitochondrial gene regions supports the earlier position of Apodanthaceae in Cucurbitales. However, the nuclear regions and majority of mitochondrial regions place Apodanthaceae in Malpighiales. Within Cucurbitaceae, we confirm the monophyly of all currently accepted tribes but also reveal deep reticulation patterns both in Cucurbitales and within Cucurbitaceae. We show that contradicting results among earlier phylogenetic studies in Cucurbitales can be reconciled when accounting for gene tree conflict and demonstrate the efficiency of CO_SCPLOWAPTUSC_SCPLOW for complex datasets.

18

PURC v2.0: a program for improved sequence inference for polyploid phylogenetics and other manifestations of the multiple-copy problem

Schafran, P. W.; Li, F.-W. W.; Rothfels, C.

2021-11-19 bioinformatics 10.1101/2021.11.18.468666 medRxiv

Top 0.1%

19.3%

Show abstract

Inferring the true biological sequences from amplicon mixtures remains a difficult bioinformatic problem. The traditional approach is to cluster sequencing reads by similarity thresholds and treat the consensus sequence of each cluster as an "operational taxonomic unit" (OTU). Recently, this approach has been improved upon by model-based methods that correct PCR and sequencing errors in order to infer "amplicon sequence variants" (ASVs). To date, ASV approaches have been used primarily in metagenomics, but they are also useful for identifying allelic or paralogous variants and for determining homeologs in polyploid organisms. To facilitate the usage of ASV methods among polyploidy researchers, we incorporated ASV inference alongside OTU clustering in PURC v2.0, a major update to PURC (Pipeline for Untangling Reticulate Complexes). In addition to preserving original PURC functions, PURC v2.0 allows users to process PacBio CCS/HiFi reads through DADA2 to generate and annotate ASVs for multiplexed data, with outputs including separate alignments for each locus ready for phylogenetic inference. In addition, PURC v2.0 features faster demultiplexing than the original version and has been updated to be compatible with Python 3. In this chapter we present results indicating that PURC v2.0 (using the ASV approach) is more likely to infer the correct biological sequences in comparison to the earlier OTU-based PURC, and describe how to prepare sequencing data, run PURC v2.0 under several different modes, and interpret the output. We expect that PURC v2.0 will provide biologists with a method for generating multi-locus "moderate data" datasets that are large enough to be phylogenetically informative and small enough for manual curation.

19

A hybrid capture RNA bait set for resolving genetic and evolutionary relationships in angiosperms from deep phylogeny to intraspecific lineage hybridization

Waycott, M.; van Dijk, J. K.; Biffin, E.

2021-09-07 evolutionary biology 10.1101/2021.09.06.456727 medRxiv

Top 0.1%

19.2%

Show abstract

Novel multi-gene targeted capture probes have been developed with the objective of obtaining multi-locus high quality sequence reads across any angiosperm lineage. Using existing genomic and transcriptomic data, two independent single assay probe/bait sets have been developed, the first targeting conserved exons from 20 low copy nuclear genes (OzBaits_NR V1.0) and the second, 19 plastid gene regions (OZBaits_CP V1.0). These universal bait sets can efficiently generate DNA sequence data that are suitable for systematics and evolutionary studies of flowering plants. The bait sets can be ordered as Daicel-Arbor Sciences custom myBaits. We demonstrate the utility of the bait set in consistently recovering the targeted genomic regions across an evolutionarily broad range of angiosperm taxa.

20

Genebank genomics allows greatly improved taxonomic correction for Capsicum spp. accessions using a novel automated classification method

Rabanus-Wallace, M. T.; Stein, N.

2022-11-10 bioinformatics 10.1101/2022.11.09.515845 medRxiv

Top 0.1%

18.7%

Show abstract

To maximise the benefit of exploiting genebank resources, accurate and complete taxonomic assignments are imperative. The rise of genebank genomics allows genetic methods to be used for this task, but these need to be largely automated since the number of samples dealt with is too great for efficient manual recategorisation, but no clearly optimal method has yet arisen. A recent landmark genebank genomic study sequenced over 10,000 accessions of peppers (Capsicum spp.), for which the exploitation of genebank material is of huge commercial, cultural, and scientific importance. This study resulted in precisely the type of dataset that will, in coming decades, be likely be produced for hundreds of plant taxa. The long-appreciated difficulties of pepper taxonomy are evident from the many obvious misclassifications noted in this and other studies, providing a perfect opportunity to simultaneously advance methods development in the area, to correct many genebank taxonomic assignments of pepper accessions, and to provide insights into pepper taxonomy in general. This paper aims to achieve these goals using an approach that combines several ideas from standard classification algorithms to create a highly flexible and customisable classifier that performs favourably when compared with key alternative methods. The various characteristics of different methods are discussed, and possible sensible alterations to pepper taxonomy based on the results are proposed for discussion by the community.